Distributed Document Clustering Using K-Means
نویسنده
چکیده
Document clustering, one of the traditional data mining techniques, is an unsupervised learning paradigm where clustering methods try to identify inherent grouping of the text documents.The importance of document clustering emerges from the massive volumes of textual documents created. Also, with more and more development of information technology, data set in many domains is reaching beyond peta-scale; making it difficult to work with the document clustering algorithms in central site and leading to the need of increasing the computational requirements. The concept of distributed computing thus; is explored for document clustering giving rise to distributed document clustering. Here, we propose distributed document clustering usingHadoop and MapReduce. We implemented Kmeans and tested on single node and then modified the map, reduce functions to run over cluster of three machines. We tested on two datasets consisting of 20000 documents (20-NewsGroups) and 21578 documents (Reuters-21578). The results show that timing requirement for clustering reduces with addition of nodes in the cluster.
منابع مشابه
Comparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملAn efficient hybrid distributed document clustering algorithm
Recent advances in information technology have led to an increase in volumes of data thereby exceeding beyond petabytes. Clustering distributed document sets from a central location is difficult due to the massive demand of computational resources. So there is a need for distributed document clustering algorithms to cluster documents using distributed resources. The greatest challenge in this a...
متن کاملA Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS
Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...
متن کاملOntology Based Document Clustering Using MapReduce
Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase in the number of available documents. Nevertheless, the features that represent those documents are also too large. The most common method for representing documents is the vector space model, which represents document features as a bag of words and does not represent semantic relations betwe...
متن کاملDesign and Implement of Distributed Document Clustering Based on MapReduce
In this paper, we describe how document clustering for large collection can be efficiently implemented with MapReduce. Hadoop implementation provides a convenient and flexible framework for distributed computing on a cluster of commodity machines. The design and implementation of tfidf and K-Means algorithm on MapReduce is presented. More importantly, we improved the efficiency and effectivenes...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014